We will examine the data sets provided by Human Activity Recognition (HAR) research project (http://groupware.les.inf.puc-rio.br/har) to classify how well a participant in the research performs a specific activity, in this case lifting of a dumb bell. This research is interesting in that it differs from prevalent HAR research into classifying what activity a human subject performs rather than how well the subject performs a known activity. Data is collected from 3 types of sensors (Gyroscope, Accelerometer, and Magnetometer) strapped on four locations including an arm, belt and a forearm of the human subject, in addition to the dumbbell itself. In the given training data set (pml-training.csv), we have 19,622 observations from 6 human subjects with 157 sensory data points plus names of the subjects, the classification of how well the activity was performed (5 classes), and an index. In total, there are 160 columns in the data set. We are also given a testing set (pml-testing.csv) consisting of 20 observations with identical 160 columns with the exception of the last column being “problem_id” instead of “classe”. We are to apply the machine learning model developed in this project on the testing set to predict performance class for each observation and submit the projections in 20 separate files.
We will first look at few representative data points to get a feel for what potential predictive capability each might have. We will remove/ignore sensory data points containing many NAs. We will try out Random Forest algorithm to see how well it works via the caret package. As a comparative analysis of algorithm performance, we will also apply Generalized Boosted Regression Model (GBM) and support Vector Machine (SVM).
We expect the out of sample error to be near less than 1% in terms of accuracy. In the best performing model, cross validation achieved over 99.4% and final prediction on out of sample data achieved 100%. Details on error estimates can be seen in the output from confusionMatrix function in the caret package for each models investigated.
We first load the data sets:
#Clear working space in RStudio
rm(list = ls(all = TRUE))
#load the caret package
library(caret)
## Loading required package: lattice
## Loading required package: ggplot2
#reading the provided training data set and final testing set
originalTraining <- read.csv("pml-training.csv",header=T)
originalTesting <- read.csv("pml-testing.csv",header=T)
#Check number of levels in the factor variable classe
(levels(originalTraining $classe))
## [1] "A" "B" "C" "D" "E"
#capture the outcome we are to predict in a separate variable
classCol <- originalTraining $classe
#capture number of human subjects, users, in a variable
(users <- levels(originalTraining$user_name))
## [1] "adelmo" "carlitos" "charles" "eurico" "jeremy" "pedro"
Lets take a look at a few sensory data points:
library(ggplot2)
ggplot(originalTraining, aes(x=user_name,y=accel_belt_x,color=classe))+geom_point(position=position_jitter(width=.5),alpha=.3)
ggplot(originalTraining, aes(x=user_name,y=yaw_arm,color=classe))+geom_point(position=position_jitter(width=.5),alpha=.3)
ggplot(originalTraining, aes(x=user_name,y=accel_arm_x,color=classe))+geom_point(position=position_jitter(width=.5),alpha=.3)
ggplot(originalTraining, aes(x=user_name,y=gyros_arm_x,color=classe))+geom_point(position=position_jitter(width=.5),alpha=.3)
We will focus on predictors representing sensors, directions, angles and locations and ignore predictors with lots of NAs.
#Columns used for prediction
sensors <- c("gyros","accel","magnet")
directions <- c("x","y","z")
angles <- c("roll","pitch","yaw")
locations <- c("belt","arm","dumbbell","forearm")
#Isolate all predictors with permutations of sensors, directions, and locations
XYZs <- sort( apply( X = expand.grid(sensors,locations,directions) , MARGIN = 1, FUN = function(s) paste(s,collapse="_") ) )
RPYs <- sort( apply( X = expand.grid(angles,locations) , MARGIN = 1, FUN = function(s) paste(s,collapse="_") ) )
(inCols <- c("user_name", XYZs, RPYs, "classe"))
## [1] "user_name" "accel_arm_x" "accel_arm_y"
## [4] "accel_arm_z" "accel_belt_x" "accel_belt_y"
## [7] "accel_belt_z" "accel_dumbbell_x" "accel_dumbbell_y"
## [10] "accel_dumbbell_z" "accel_forearm_x" "accel_forearm_y"
## [13] "accel_forearm_z" "gyros_arm_x" "gyros_arm_y"
## [16] "gyros_arm_z" "gyros_belt_x" "gyros_belt_y"
## [19] "gyros_belt_z" "gyros_dumbbell_x" "gyros_dumbbell_y"
## [22] "gyros_dumbbell_z" "gyros_forearm_x" "gyros_forearm_y"
## [25] "gyros_forearm_z" "magnet_arm_x" "magnet_arm_y"
## [28] "magnet_arm_z" "magnet_belt_x" "magnet_belt_y"
## [31] "magnet_belt_z" "magnet_dumbbell_x" "magnet_dumbbell_y"
## [34] "magnet_dumbbell_z" "magnet_forearm_x" "magnet_forearm_y"
## [37] "magnet_forearm_z" "pitch_arm" "pitch_belt"
## [40] "pitch_dumbbell" "pitch_forearm" "roll_arm"
## [43] "roll_belt" "roll_dumbbell" "roll_forearm"
## [46] "yaw_arm" "yaw_belt" "yaw_dumbbell"
## [49] "yaw_forearm" "classe"
inTraining<-as.data.frame(originalTraining[,inCols])
We now have a training set with 50 columns, from which we will split into training, validation and testing partitions using caret with 60%, 20%, 20% proportions respectively.
set.seed(12345)
indexTrain <- createDataPartition(y= inTraining $classe,p=0.6,list=FALSE)
trainingSet<-inTraining[indexTrain,]
restT<-inTraining[-indexTrain,]
indexV<- createDataPartition(y= restT $classe,p=0.5,list=FALSE)
validationSet<-restT[indexV,]
testingSet<-restT[-indexV,]
dim(trainingSet); dim(validationSet); dim(testingSet);
## [1] 11776 50
## [1] 3923 50
## [1] 3923 50
We proceed to fit a model using Random Forest via caret package and use the validation data set to see how well it performed:
set.seed(122333)
fitRF <- train(classe~.,method="rf",data=trainingSet)
## Loading required package: randomForest
## randomForest 4.6-7
## Type rfNews() to see new features/changes/bug fixes.
## Loading required package: class
predictValRF <- predict(fitRF,validationSet)
confusionMatrix(predictValRF,validationSet$classe)
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1116 4 0 0 0
## B 0 752 1 1 1
## C 0 3 683 11 1
## D 0 0 0 631 2
## E 0 0 0 0 717
##
## Overall Statistics
##
## Accuracy : 0.994
## 95% CI : (0.991, 0.996)
## No Information Rate : 0.284
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.992
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 1.000 0.991 0.999 0.981 0.994
## Specificity 0.999 0.999 0.995 0.999 1.000
## Pos Pred Value 0.996 0.996 0.979 0.997 1.000
## Neg Pred Value 1.000 0.998 1.000 0.996 0.999
## Prevalence 0.284 0.193 0.174 0.164 0.184
## Detection Rate 0.284 0.192 0.174 0.161 0.183
## Detection Prevalence 0.285 0.192 0.178 0.161 0.183
## Balanced Accuracy 0.999 0.995 0.997 0.990 0.997
We achieved a prediction accuracy of 99.4% with the validation set. We would expect the out-of sample errors while predicting on testing set to be close to what we achieved with validation set:
predictRF <- predict(fitRF,testingSet)
confusionMatrix(predictRF,testingSet$classe)
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1112 6 0 0 0
## B 4 751 6 0 0
## C 0 2 675 6 2
## D 0 0 3 634 2
## E 0 0 0 3 717
##
## Overall Statistics
##
## Accuracy : 0.991
## 95% CI : (0.988, 0.994)
## No Information Rate : 0.284
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.989
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.996 0.989 0.987 0.986 0.994
## Specificity 0.998 0.997 0.997 0.998 0.999
## Pos Pred Value 0.995 0.987 0.985 0.992 0.996
## Neg Pred Value 0.999 0.997 0.997 0.997 0.999
## Prevalence 0.284 0.193 0.174 0.164 0.184
## Detection Rate 0.283 0.191 0.172 0.162 0.183
## Detection Prevalence 0.285 0.194 0.175 0.163 0.184
## Balanced Accuracy 0.997 0.993 0.992 0.992 0.997
The performance on out of sample testing test is slightly below that on validation set, with an accuracy of 99.1%. Certainly, we could strive for something close to 100% but could risk over-fitting. Nevertheless, as a comparative analysis, we will proceed to apply Support Vector Machine (SVM) with Radial Basis kernel and Generalized Boosted Regression Model (GBM).
set.seed(222333)
fitSVM <- train(classe~.,method="svmRadial",data=trainingSet)
predictValSVM <- predict(fitSVM,validationSet)
confusionMatrix(predictValSVM,validationSet$classe)
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1106 64 6 3 1
## B 3 653 31 6 9
## C 4 36 633 79 39
## D 2 1 14 552 22
## E 1 5 0 3 650
##
## Overall Statistics
##
## Accuracy : 0.916
## 95% CI : (0.907, 0.925)
## No Information Rate : 0.284
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.894
## Mcnemar's Test P-Value : <2e-16
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.991 0.860 0.925 0.858 0.902
## Specificity 0.974 0.985 0.951 0.988 0.997
## Pos Pred Value 0.937 0.930 0.800 0.934 0.986
## Neg Pred Value 0.996 0.967 0.984 0.973 0.978
## Prevalence 0.284 0.193 0.174 0.164 0.184
## Detection Rate 0.282 0.166 0.161 0.141 0.166
## Detection Prevalence 0.301 0.179 0.202 0.151 0.168
## Balanced Accuracy 0.982 0.922 0.938 0.923 0.949
predictSVM <- predict(fitSVM,testingSet)
confusionMatrix(predictSVM,testingSet$classe)
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1103 69 2 3 2
## B 5 658 43 7 9
## C 5 30 616 69 18
## D 1 2 23 561 19
## E 2 0 0 3 673
##
## Overall Statistics
##
## Accuracy : 0.92
## 95% CI : (0.912, 0.929)
## No Information Rate : 0.284
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.899
## Mcnemar's Test P-Value : <2e-16
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.988 0.867 0.901 0.872 0.933
## Specificity 0.973 0.980 0.962 0.986 0.998
## Pos Pred Value 0.936 0.911 0.835 0.926 0.993
## Neg Pred Value 0.995 0.968 0.979 0.975 0.985
## Prevalence 0.284 0.193 0.174 0.164 0.184
## Detection Rate 0.281 0.168 0.157 0.143 0.172
## Detection Prevalence 0.301 0.184 0.188 0.154 0.173
## Balanced Accuracy 0.981 0.923 0.931 0.929 0.966
SVM produced an accuracy of 91.6% on validation data set and 92% on testing set - underperforms Random Forest.
set.seed(322333)
fitGBM <- train(classe~.,method="gbm", data=trainingSet, verbose = FALSE)
## Loading required package: gbm
## Loading required package: survival
## Loading required package: splines
##
## Attaching package: 'survival'
##
## The following object is masked from 'package:caret':
##
## cluster
##
## Loading required package: parallel
## Loaded gbm 2.1
## Loading required package: plyr
predictValGBM <- predict(fitGBM,validationSet)
confusionMatrix(predictValGBM,validationSet$classe)
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1102 20 0 2 0
## B 9 710 11 1 9
## C 4 23 667 25 7
## D 1 5 5 611 10
## E 0 1 1 4 695
##
## Overall Statistics
##
## Accuracy : 0.965
## 95% CI : (0.959, 0.97)
## No Information Rate : 0.284
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.955
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.987 0.935 0.975 0.950 0.964
## Specificity 0.992 0.991 0.982 0.994 0.998
## Pos Pred Value 0.980 0.959 0.919 0.967 0.991
## Neg Pred Value 0.995 0.985 0.995 0.990 0.992
## Prevalence 0.284 0.193 0.174 0.164 0.184
## Detection Rate 0.281 0.181 0.170 0.156 0.177
## Detection Prevalence 0.287 0.189 0.185 0.161 0.179
## Balanced Accuracy 0.990 0.963 0.978 0.972 0.981
predictGBM <- predict(fitGBM,testingSet)
confusionMatrix(predictGBM,testingSet$classe)
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1096 27 0 1 0
## B 14 713 23 4 13
## C 4 18 652 21 5
## D 2 1 9 610 13
## E 0 0 0 7 690
##
## Overall Statistics
##
## Accuracy : 0.959
## 95% CI : (0.952, 0.965)
## No Information Rate : 0.284
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.948
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.982 0.939 0.953 0.949 0.957
## Specificity 0.990 0.983 0.985 0.992 0.998
## Pos Pred Value 0.975 0.930 0.931 0.961 0.990
## Neg Pred Value 0.993 0.985 0.990 0.990 0.990
## Prevalence 0.284 0.193 0.174 0.164 0.184
## Detection Rate 0.279 0.182 0.166 0.155 0.176
## Detection Prevalence 0.287 0.196 0.178 0.162 0.178
## Balanced Accuracy 0.986 0.961 0.969 0.971 0.977
GBM produced an accuracy of 96.5% on validation data set and 95.9% on testing set.
Random Forest outperformed both GBM and SMV models for this specific set of predictors we have chosen. Their respective performance might differ if a different set of predictors were used.
As such, to predict the outcome (classe) in the given testing set, we will use fitted Random Forest model.
inColsTest <- c("user_name", XYZs, RPYs)
inTesting<-as.data.frame(originalTesting[,inColsTest])
predictTestRF<-predict(fitRF,inTesting)
(answers<-as.character(predictTestRF))
## [1] "B" "A" "B" "A" "A" "E" "D" "B" "A" "A" "B" "C" "B" "A" "E" "E" "A"
## [18] "B" "B" "B"
Finally, we will write out the twenty predictions to individual files:
pml_write_files = function(x){
n = length(x)
for(i in 1:n){
filename = paste0("problem_id_",i,".txt")
write.table(x[i],file=filename,quote=FALSE,row.names=FALSE,col.names=FALSE)
}
}
pml_write_files(answers)
The prediction on the provided out of sample test data set achieved 100% accuracy after the answers were submitted.
We used sensory data collected from six human subjects while lifting dumb bells to predict how well they performed the task of weight lifting the dumb bell. Specifically, we chose only the 49 relevant data points on sensors, directions, angles, and locations in addition to subjects names as predictors in the models. Amongst the machine learning algorithms we applied, Random Forest outperformed Support Vector Machine and Generalized Boosted Regression Model. We could have tried model ensemble approach to see if predictive power could have been increased. But with an accuracy of over 99% achieved by Random Forest, we reasoned that any additional accuracy might risk over-fitting thus lose the power of generalization on other out of sample data sets.